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Abstract 

We reduce phrase-representation parsing 
to dependency parsing. Our reduction is 
grounded on a new intermediate represen¬ 
tation, “head-ordered dependency trees,” 
shown to be isomorphic to constituent 
trees. By encoding order information in 
the dependency labels, we show that any 
off-the-shelf, trainable dependency parser 
can be used to produce constituents. When 
this parser is non-projective, we can per¬ 
form discontinuous parsing in a very natu¬ 
ral manner. Despite the simplicity of our 
approach, experiments show that the re¬ 
sulting parsers arc on par with strong base¬ 
lines, such as the Berkeley parser for En¬ 
glish and the best single system in the 
SPMRL-2014 shared task. Results arc par¬ 
ticularly striking for discontinuous parsing 
of German, where we surpass the current 
state of the art by a wide margin. 

1 Introduction 

Constituent parsing is a central problem in 
NLP—one at which statistical models trained on 
treebanks have excelled (Charniak, 1996; Klein 
and Manning, 2003; Petrov and Klein, 2007). 
However, most existing parsers are slow, since 
they need to deal with a heavy grammar con¬ 
stant. Dependency parsers arc generally faster, but 
less informative, since they do not produce con¬ 
stituents, which are often required by downstream 
applications (Johansson and Nugues, 2008; Wu et 
al., 2009; Berg-Kirkpatrick et al., 2011; Elming et 
al., 2013). How to get the best of both worlds? 

Coarse-to-linc decoding (Charniak and John¬ 
son, 2005) and shift-reduce parsing (Sagae and 
Lavie, 2005; Zhu et al., 2013) were a step for- 

* This research was carried out during an internship at 
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ward to accelerate constituent parsing, but run¬ 
times still lag those of dependency parsers. This 
is only made worse if discontinuous constituents 
arc allowed—such discontinuities arc convenient 
to represent wh-movement, scrambling, extrapo¬ 
sition, and other linguistic phenomena common in 
free word order languages. While non-projective 
dependency parsers, which arc able to model such 
phenomena, have been widely developed in the 
last decade (Nivre et al., 2007; McDonald et al., 
2006; Martins et al., 2013), discontinuous con¬ 
stituent parsing is still taking its first steps (Maier 
and Spgaard, 2008; Kallmeyer and Maier, 2013). 

In this paper, we show that an off-the-shelf, 
trainable, dependency parser is enough to build 
a highly-competitive constituent parser. This (sur¬ 
prising) result is based on a reduction of con¬ 
stituent to dependency parsing, followed by a sim¬ 
ple post-processing procedure to recover unaries. 
Unlike other constituent parsers, ours does not 
require estimating a grammar, nor binarizing the 
treebank. Moreover, when the dependency parser 
is non-projective, our method can perform discon¬ 
tinuous constituent parsing in a very natural way. 

Key to our approach is the notion of head- 
ordered dependency trees (shown in Figure 1): 
by endowing dependency trees with this additional 
layer of structure, we show that they become iso¬ 
morphic to constituent trees. We encode this struc¬ 
ture as part of the dependency labels, enabling a 
dependency-to-constituent conversion. Hall and 
Nivre (2008) attempted a related conversion to 
parse German, but their complex encoding scheme 
blows up the number of arc labels, affecting the fi¬ 
nal parser’s quality. By contrast, our light encod¬ 
ing achieves a 10-fold decrease in the number of 
labels, translating into more accurate parsing. 

While simple, our reduction-based parsers are 
on par with the Berkeley parser for English (Petrov 
and Klein, 2007), and with the best single system 
in the recent SPMRL shared task (Seddah et al., 



2014), for eight morphologically rich languages. 
For discontinuous parsing, we surpass the current 
state of the art by a wide margin on two German 
datasets (TIGER and NEGRA), while achieving 
fast parsing speeds. Our parsers will be released 
along with this paper as accompanying software. 

2 Background 

We start by reviewing constituent and dependency 
representations, and setting up the notation. Fol¬ 
lowing Kong and Smith (2014), we use c-/d- pre¬ 
fixes for convenience (e.g., we write c-parser for 
constituent parser and d-tree for dependency tree). 

2.1 Constituent Trees 

Constituent-based representations are commonly 
seen as derivations according to a context-free 
grammar (CFG). Here, we focus on properties 
of the c-trees, rather than of the grammars used 
to generate them. We consider a broad scenario 
that permits c-trees with discontinuities, such as 
the ones derived with linear context-free rewrit¬ 
ing systems (LCFRS; Vijay-Shanker et al. (1987)). 
We also assume that the c-trees are lexicalized. 

Formally, let w\W 2 ■ ■ ■ wl be a sentence, where 
Wi denotes the word in the ith position. A c- 
tree is a rooted tree whose leaves arc the words 
i. and whose internal nodes (constituents) 
arc represented as a tuple ( Z,h,I ), where Z 
is a non-terminal symbol, h € {1,..., L} in¬ 
dicates the lexical head, and X C {1 
is the node’s yield. Each word’s parent is a 
pre-terminal unary node of the form (/>,, i. {?'}), 
where pi denotes the word’s part-of-speech (POS) 
tag. The yields and lexical heads are defined so 
that for every constituent ( Z,h,I ) with children 
{(X k ,m k , J k )}k =1 , (i) we have X = \J k=1 J k \ 
and (ii) there is a unique k such that h = m k . This 
A:th node (called the head-child node) is commonly 
chosen applying an handwritten set of head rules 
(Collins, 1999; Yamada and Matsumoto, 2003). 

A c-tree is continuous if all nodes ( Z,h,I ) 
have a contiguous yield X, and discontinuous oth¬ 
erwise. Trees derived from a CFG are always con¬ 
tinuous; those derived by a LCFRS may have dis¬ 
continuities, the yield of a node being a union of 
spans, possibly with gaps in the middle. Figure 1 
shows an example of a continuous and a discontin¬ 
uous c-tree. Discontinuous c-trees have crossing 
branches, if the leaves arc drawn in left-to-right 
surface order. An internal node which is not a pre¬ 


terminal is called a proper node. A node is called 
unary if it has exactly one child. A c-tree with¬ 
out unary proper nodes is called unaryless. If all 
proper nodes have exactly two children then it is 
called a binary c-tree. Continuous binary trees 
may be regarded as having been generated by a 
CFG in Chomsky normal form. 

Prior work. There has been a long string of 
work in statistical c-parsing, shifting from sim¬ 
ple models (Charniak, 1996) to more sophisticated 
ones using structural annotation (Johnson, 1998; 
Klein and Manning, 2003), latent grammars (Mat- 
suzaki et ah, 2005; Petrov and Klein, 2007), and 
lexicalization (Eisner, 1996; Collins, 1999). An 
orthogonal line of work uses ensemble or rerank¬ 
ing strategies to further improve accuracy (Char¬ 
niak and Johnson, 2005; Huang, 2008; Bjorkelund 
et ah, 2014). Discontinuous c-parsing is con¬ 
sidered a much harder problem, involving mildly 
context-sensitive formalisms such as LCFRS or 
range concatenation grammars, with treebank- 
derived c-parsers exhibiting near-exponential run¬ 
time (Kallmeyer and Maier, 2013, Figure 27). 
To speed up decoding, prior work has consid¬ 
ered restrictons, such as bounding the fan-out 
(Maier et ah, 2012) and requiring well-nestedness 
(Kuhlmann and Nivre, 2006; Gomez-Rodrtguez et 
al., 2010). Other approaches eliminate the dis¬ 
continuities via tree transformations (Boyd, 2007; 
Kiibler et ah, 2008), sometimes as a pruning step 
followed by reranking (van Cranenburgh and Bod, 
2013). However, reported runtimes arc still supe¬ 
rior to 10 seconds per sentence, which is not prac¬ 
tical. Recently, Versley (2014a) proposed an easy- 
first approach that leads to considerable speed- 
ups, but is less accurate. In this paper, we de¬ 
sign fast discontinuous c-parsers that outperform 
all the ones above by a wide margin, with similar 
runtimes as Versley (2014a). 

2.2 Dependency Trees 

In this paper, we use d-parsers as a black box to 
parse constituents. Given a sentence w\ ... wl, 
a d-tree is a directed tree spanning all the words 
in the sentence. 1 Each arc in this tree is a tuple 
(h, rri, £), expressing a typed dependency relation 
l between the head word Wh and the modifier w rn . 

A d-tree is projective if for every arc (h. m, £) 

'We assume throughout that dependency trees have a sin¬ 
gle root among (rui,..., wl}- Therefore, there is no need to 
consider an extra root symbol, as often done in the literature. 




still cautious Es kam nichts Interessantes 




Figure 1: Top: a continuous (left) and a discontinuous (right) c-tree, taken from English PTB §22 and 
German NEGRA, respectively. Head-child nodes are in bold. Bottom: corresponding head-ordered d- 
trees. The indices #1, #2, etc. denote the order of attachment events for each head. The English unary 
nodes AD VP and AD JP are dropped in the conversion. 
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really needs caution ! ! ! 

RB VBZ NN really needs caution 


VP 



really VBZ NN 


needs caution 


VP 



RB VBZ caution 


really needs 


Figure 2: Three different c-structures for the VP 
“really needs caution .” All are consistent with the 
d-structure at the top left. 


there is a directed path from h to all words that lie 
between h and m in the surface string (Kahane et 
al., 1998). Projective d-trees can be obtained from 
continuous c-trees by reading off the lexical heads 
and dropping the internal nodes (Gaifman, 1965). 
However, this relation is many-to-one: as shown 
in Figure 2, several c-trees may project onto the 
same d-tree, differing on their flatness and on left 
or right-branching decisions. In the next section, 
we introduce the concept of head-ordered d-trees 
and express one-to-one mappings between these 
two representations. 

Prior work. There has been a considerable 
amount of work developing rich-feature d-parsers. 
While projective d-parsers can use dynamic pro¬ 


gramming (Eisner and Satta, 1999; Koo and 
Collins, 2010), non-projective d-parsers typically 
rely on approximate decoders, since the underly¬ 
ing problem is NP-hard beyond arc-factored mod¬ 
els (McDonald and Satta, 2007). An alternative 
are transition-based d-parsers (Nivre et ah, 2006; 
Zhang and Nivre, 2011), which achieve observed 
lineal - time. Since d-parsing algorithms do not 
have a grammar constant, typical implementations 
are significantly faster than c-parsers (Rush and 
Petrov, 2012; Martins et al., 2013). The key con¬ 
tribution of this paper is to reduce c-parsing to d- 
parsing, allowing to bring these runtimes closer. 

3 Head-Ordered Dependency Trees 

We next endow d-trees with another layer of struc¬ 
ture, namely order information. In this frame¬ 
work, not all modifiers of a head are “born equal.” 
Instead, their attachment to the head occurs as 
a sequence of “events,” which reflect the head’s 
preference for attaching some modifiers before 
others. As we will see, this additional structure 
will undo the ambiguity expressed in Figure 2. 

3.1 Strictly Ordered Dependency Trees 

Let us start with the simpler case where the attach¬ 
ment order is strict. For each head word h with 
modifiers Mh = {mi,..., m/y }, we endow A//, 
with a strict order relation A/,., so we can or¬ 
ganize all the modifiers of h as a chain, m,, A /, 




m 2 mi m 3 


Figure 3: Transformation of a strictly-ordered d- 
trcc into a binary c-tree. 

rn l2 ... <}, rrii K . We regard this chain as 
reflecting the order by which words are attached 
(i.e., if m, rrij this means that “m, is attached 
to h before mf’). We represent this graphically 
by decorating d-arcs with indices (#1, #2,...) to 
denote the order of events, as we do in Figure 1. 

A d-tree endowed with a strict order for each 
head is called a strictly ordered d-tree. We es¬ 
tablish below a correspondence between strictly 
ordered d-trees and binary c-trees. Before doing 
so, we need a few more definitions about c-trees. 
For each word position h G {1,..., L}, we define 
ip(h) as the node higher in the c-tree whose lexi¬ 
cal head is h. We call the path from ip(h) down to 
the pre-terminal ph the spine of h. We may regal'd 
a c-tree as a set of L spines, one per word, which 
attach to each other to form a tree (Carreras et al., 
2008). We then have the following 

Proposition 1. Binary c-trees and strictly-ordered 
d-trees are isomorphic, i.e., there is a one-to-one 
correspondence between the two sets, where the 
number of symbols is preserved. 

Proof. We use the construction in Figure 3. We 
will show that, given an arbitrary strictly-ordered 
d-tree V, we can perform an invertible transfor¬ 
mation to turn it into a binary c-tree C; and vice- 
versa. Let V be given. We visit each node h G 
{1 ,... ,L} and split it into K + 1 nodes, where 
K = \Mh\, organized as a linked list, as Fig¬ 
ure 3 illustrates (this will become the spine of h 
in the c-tree). For each modifier mj.. G M ^ with 
mi -<h ■ ■ ■ <h tn-K, move the tail of the arc 
(h,mk,Zk) to the (K + 1 — k)th node of the 
linked list and assign the label Zp. to this node, 
letting h be its lexical head. Since the incoming 
and outgoing arcs of the linked list component are 
the same as in the original node h, the tree struc¬ 
ture is preserved. After doing this for every h, 
add the leaves and propagate the yields bottom up. 
It is straightforward to show that this procedure 
yields a valid binary c-tree. Since there is no loss 


of information (the orders <p, are implied by the 
order of the nodes in each spine), this construc¬ 
tion can be inverted to recover the original d-tree. 
Conversely, if we start with a binary c-tree, tra¬ 
verse the spine of each h, and attach the modifiers 
mi -<h • • • <h m K in order, we get a strictly or¬ 
dered d-tree (also an invertible procedure). □ 

3.2 Weakly Ordered Dependency Trees 

Next, we relax the strict order assumption, restrict¬ 
ing the modifier sets Alp, = {mi, ..., nij(} to be 
only weakly ordered. This means that we can par¬ 
tition the K modifiers into J equivalence classes, 
A//, = Uy=! M° h , and define a strict order -<T on 
the quotient set: ... -<h Afj. Intuitively, 

there is still a sequence of events (1 to J), but now 
at each event j it may happen that multiple mod¬ 
ifiers (the ones in the equivalence set Ml) are si¬ 
multaneously attached to h. A weakly ordered 
d-tree is a d-tree endowed with a weak order for 
each head and such that any pair m, m! in the same 
equivalence class (written m =/, ml) receive the 
same dependency label l. 

We now show that Proposition 1 can be gener¬ 
alized to weakly ordered d-trees. 

Proposition 2. Unaryless c-trees and weakly- 
ordered d-trees are isomorphic. 

Proof. This is a simple extension of Proposition 1. 
The construction is the same as in Figure 3, but 
now we can collapse some of the nodes in the 
linked list, originating more than one modifier at¬ 
taching to the same position of the spine—this is 
only possible for sibling arcs with the same index 
and the same arc label. Note, however, that if we 
started with a c-tree with unary nodes and tried to 
invert this procedure to obtain a d-tree, the unary 
nodes would be lost, since they do not involve at¬ 
tachment of modifiers. In a chain of unary nodes, 
only the last node would be recovered when in¬ 
verting this transformation. □ 

We emphasize that Propositions 1-2 hold with¬ 
out blowing up the number of symbols. That is, 
the dependency label alphabet is exactly the same 
as the set of phrasal symbols in the constituent 
representations. Algorithms 1-2 convert back and 
forth between the two formalisms, performing the 
construction of Figure 3. Both algorithms run in 
lineal' time with respect to the size of the sentence. 






Algorithm 1 Conversion from c-tree to d-tree 
Input: c-tree C. 

Output: head-ordered d-tree V. 

1: Nodes := GetPostOrderTraversal(C). 

2: Set j(h) := 1 for every h = 1,..., L. 

3: for v := (Z , h,X) £ Nodes do 

4: for evety u := ( X , m, J) which is a child of v do 

5: if m f h then 

6: Add to V an arc (h, m, Z }, and put it in M 3 f h \ 

7: end if 

8: end for 

9: Set j(h) m j(h) + 1. 

10: end for 

Algorithm 2 Conversion from d-tree to c-tree 
Input: head-ordered d-tree V. 

Output: c-tree C. 

1: Nodes := GetPostOrderTraversal(£>). 

2: for h £ Nodes do 

3: Create v := ( pn , h, {/i}} and set f(h) := v. 

4: Sort M h (V), yielding Mp -< h Ml -< h • • • -<h Mf 

5: for j = 1,..., J do 

6: Let Z be the label in {{h, m, Z) \ m £ M 3 h }. 

7: Obtain c-nodes ip(h) = (X,h,T) and ip(m) = 

( Ym,m, Jm) for all m £ M 3 h . 

8: Add c-node v := (Z, h ,III U m( =MJ dim) toC. 

9: Set 'ip(h) and {ip(m) \ m £ M^} as children of v. 

10: Set ip(h) := v. 

11: end for 

12: end for 

3.3 Continuous and Projective Trees 

What about the more restricted class of projective 
d-trees? Can we find an equivalence relation with 
continuous c-trees? In this section, we give a pre¬ 
cise answer to this question. It turns out that we 
need an additional property, illustrated in Figure 4. 

We say that -<h has the nesting property iff 
closer words in the same direction are always at¬ 
tached first, /.<?.. iff h < rrii < rrij or h > m t > 
m,j implies that either m; =/,, nij or m; -<h rn j ■ 
A weakly-ordered d-tree which is projective and 
whose orders A /, have the nesting property for ev¬ 
ery h is called a nested-weakly ordered projec¬ 
tive d-tree. We then have the following result. 

Proposition 3. Continuous unaryless c-trees and 
nested-weakly ordered projective d-trees are iso¬ 
morphic. 

Proof. We need to show that (i) Algorithm 1, 
when applied to a continuous c-tree C, retrieves 
a head ordered d-tree V which is projective and 
has the nesting property, (ii) vice-versa for Algo¬ 
rithm 2. To see (i), note that the projectiveness of 
V is ensured by the well-known result of Gaifman 
(1965) about the projection of continuous trees. 
To show that it satisfies the nesting property, note 
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Figure 4: Two discontinuous constructions caused 
by a non-nested order (top) and a non-projective 
d-tree (bottom). In both cases node A has a non¬ 
contiguous yield. 

that nodes higher in the spine of a word h are al¬ 
ways attached by modifiers farther apart (other¬ 
wise edges in C would cross, which cannot happen 
for a continuous C). To prove (ii), we use induc¬ 
tion. We need to show that every created c-node in 
Algorithm 2 has a contiguous span as yield. The 
base case (line 3) is trivial. Therefore, it suffices to 
show that in line 8, assuming the yields of (the cur¬ 
rent) 'ip(h) and each 'ip(rn) are contiguous spans, 
the union of these yields is also contiguous. Con¬ 
sider the node v when these children have been 
appended (line 9), and choose m E Ml arbitrar¬ 
ily. We only need to show that for any d between 
h and m, d belongs to the yield of v. Since V 
is projective and there is a d-arc between h and 
m, we have that d must descend from h. Further¬ 
more, since projective trees cannot have crossing 
edges, we have that h has a unique child a, also 
between h and m, which is an ancestor of d (or d 
itself). Since a is between h and m, from the nest¬ 
ing property, we must have (h, m, i) (h, a , £') 

Therefore, since we arc processing the modifiers 
in order, we have that ip (a) is already a descen¬ 
ded of v after line 9, which implies that the yield 
of ip (a) (which must include d, since d descends 
from a) must be contained in the yield of v. □ 

Together, Propositions 1-3 have as corollary 
that nested-strictly ordered projective d-trees arc 
in a one-to-one correspondence with binary con¬ 
tinuous c-trees. The intuition is simple: if -<h has 
the nesting property, then, at each point in time, all 
one needs to decide about the next event is whether 
to attach the closest available modifier on the left 
or on the right. This corresponds to choosing 
between left-branching or right-branching in a c- 
tree. While this is potentially interesting for most 
continuous c-parsers, which work with binarized 
c-trees when running the CKY algorithm, our c- 






parsers (to be described in §4) do not require any 
binarization since they work with weakly-ordered 
d-trees, using Proposition 2. 

4 Reduction-Based Constituent Parsers 

We next show how to use the equivalence re¬ 
sults obtained in the previous section to design c- 
parsers when only a trainable d-parser is available. 

Given a c-treebank provided as input, our pro¬ 
cedure is outlined as follows: 

1. Convert the c-treebank to dependencies (Algo¬ 
rithm 1). 

2. Train a labeled d-parser on this treebank. 

3. For each test sentence, run the labeled d-parser 
and convert the predicted d-tree into a c-tree 
without unary nodes (Algorithm 2). 

4. Do post-processing to recover unaries. 

The next subsections describe each of these steps. 
Along the way, we illustrate with experiments us¬ 
ing the English Penn Treebank (Marcus et al., 
1993), which we lexicalized by applying the head 
rules of Collins (1999). 2 

4.1 Dependency Encoding 

The first step is to convert the c-treebank to head- 
ordered dependencies, which we do using Algo¬ 
rithm 1. If the original treebank has discontinu¬ 
ous c-trees, we end up with non-projective d-trees 
or with violations of the nested property, as estab¬ 
lished in Proposition 3. We handle this gracefully 
by training a non-projective d-parser in the sub¬ 
sequent stage (see §4.2). Note also that this con¬ 
version drops the unary nodes (a consequence of 
Proposition 2). These nodes will be recovered in 
the last stage, as described in §4.4. 

Since in this paper we arc assuming that only an 
off-the-shelf d-parser is available, we need to con¬ 
vert head-ordered d-trees to plain d-trees. We do 
so by encoding the order information in the depen¬ 
dency labels. We tried two different strategies to 
do that. The hist one is a direct encoding, where 
we just append suffixes #1, #2, etc., as illustrated 
in Figure 1. A disadvantage is that the number of 
dependency labels may grow unbounded with the 
treebank size, since we may encounter complex 
substructures where the event sequences arc long. 

2 We train on §02-21, use §22 for validation, and test on 
§23. We predict automatic POS tags with TurboTagger (Mar¬ 
tins et al., 2013), with 10-fold jackknifing on the training set. 


The second strategy is a delta-encoding scheme 
where, rather than writing the absolute indices in 
the dependency label, we write the differences be¬ 
tween consecutive indices. 3 We used this strategy 
for the continuous treebanks only, whose d-trees 
arc guaranteed to satisfy the nested property. 

For comparison, we implemented a third strat¬ 
egy replicating the encoding proposed by Hah and 
Nivre (2008), which we call H&N-encoding. This 
scheme concatenates ah the c-nodes’ labels in the 
modifier’s spine with the attachment position in 
the head’s spine (for example, in Figure 3, if the 
modifier mo has a spine with nodes X\, X 2 , X 3 , 
the generated d-label would be X 1 |X 2 |X 3 ^2; our 
direct encoding scheme generates Z 2 #2 instead). 
Since their strategy encodes the entire spines into 
complex arc labels, many such labels will be gen¬ 
erated, leading to slower runtimes and poorer gen¬ 
eralization, as we will see. 

For the training portion of the English PTB, 
which contains 27 non-terminal symbols (exclud¬ 
ing the POS tags), the direct encoding strategy 
yields 75 labels, while delta encoding yields 69 la¬ 
bels (averaging 2.6 indices per symbol). By con¬ 
trast, the HN-encoding procedure yields 731 la¬ 
bels, more than 10 times as many. We later show 
(in Table 1) that delta-encoding leads to a slightly 
higher c-parsing accuracy than direct encoding, 
and that both strategies arc considerably more ac¬ 
curate than H&N-encoding. 

4.2 Training the Labeled Dependency Parser 

The next step is to train a labeled d-parser on the 
converted treebank. If we are doing continuous c- 
parsing, we train a projective d-parser; otherwise 
we train a non-projective one. 

In our experiments, we found it advantageous to 
perform labeled d-parsing in two stages, as done 
by McDonald et al. (2006): first, train an unla¬ 
beled d-parser; then, train a dependency labeler. 4 
Table 1 compares this approach against a one- 
shot strategy, experimenting with various off-the- 
shelf d-parsers: MaltParser (Nivre et al., 2007), 
MSTParser (McDonald et al., 2005), ZPar (Zhang 

3 For example, if #1, #3, #4 and #2, #3, #3, #5 are 
respectively the sequence of indices from the head to the left 
and to the right, we encode these sequences as #1, #2, #1 
and #2, #1, #0, #2 (using 3 distinct indices instead of 5). 

4 The reason why a two-stage approach is preferable is that 
one-shot d-parsers, for efficiency reasons, use label features 
parsimoniously. However, for our reduction approach, the 
dependency labels are crucial and strongly interdependent, 
since they jointly encode the constituent structure. 



and Nivre, 2011), and TurboParser (Martins et 
al., 2013), all with the default settings. For Tur¬ 
boParser, we used basic, standard and full models. 

Our separate d-labeler receives as input a back¬ 
bone d-structure and predicts a label for each arc. 
For each head h, we decode the modifiers’ la¬ 
bels independently from the other heads, using 
a simple sequence model, which contains fea¬ 
tures of the form c p(h, m, £) and <f>(h, m, m', £, £'), 
where m and w! are two consecutive modifiers 
(either on the same side or on opposite sides of 
the head) and £ and £’ are their labels. We used 
the same arc label features (j>(h,m,£) as Tur¬ 
boParser. For <f>(h,m,m',£,£'), we use the POS 
triplet {phi Pm i Pm’) • pl us unilexical versions of 
this triplet, where each of the three POS is re¬ 
placed by the word form. Both features arc con¬ 
joined with the label pair £ and £'. Decoding un¬ 
der this model can be done by running the Viterbi 
algorithm independently for each head. The run¬ 
time is almost negligible compared with the time 
to parse: it took 2.1 seconds to process PTB §22, 
a fraction of about 5% of the total runtime. 

4.3 Decoding into Unaryless Constituents 

After training the labeled d-parser, we can run it 
on the test data. Then, we need to convert the pre¬ 
dicted d-tree into a c-tree without unaries. 

To accomplish this step, we first need to recover, 
for each head h, the weak order of its modifiers 
M/j. We do this by looking at the predicted de¬ 
pendency labels, extract the event indices j, and 
use them to build and sort the equivalent classes 
{M^}j =1 . If two modifiers have the same index 
j, we force them to have consistent labels (by al¬ 
ways choosing the label of the modifier which is 
the closest to the head). For continuous c-parsing, 
we also decrease the index j of the modifier closer 
to the head as much as necessary to make sure that 
the nesting property holds. In PTB §22, these cor¬ 
rections were necessary only for 0.6% of the to¬ 
kens. Having done this, we use Algorithm 2 to 
obtain a predicted c-tree without unary nodes. 

4.4 Recovery of Unary Nodes 

Finally, the last stage is to recover the unary nodes. 
Given a unaryless c-tree as input, we predict unar¬ 
ies by running independent multi-class classifiers 
at each node in the tree (a simple unstructured 
task). Each class is either NULL (in which case 
no unary node is appended to the current node) 
or a concatenation of unary node labels (e.g.. 


S->AD JP for a node JJ); we obtained 64 classes 
by processing the training sections of the PTB, the 
fraction of unary nodes being about 11 % of the to¬ 
tal number of non-terminal nodes. To reduce com¬ 
plexity, for each node symbol we only consider 
classes that have been observed for that symbol in 
the training data. In PTB §22, we obtained an av¬ 
erage of 9.9 candidate labels per node occurrence. 

These classifiers are trained on the original c- 
treebank, stripping off unary nodes and trained to 
recover those nodes. We used the following fea¬ 
tures (conjoined with the class and with a flag in¬ 
dicating if the node is a pre-terminal): 

• The production rules above and beneath the 
node (e.g., S->NP VP and NP->DT NN); 

• The node’s label, alone and conjoined with the 
parent’s label or the left/right sibling’s label; 

• The leftmost and rightmost word/lemma/POS 
tag/morpho-syntactic tags in the node’s yield; 

• If the left/right node is a pre-terminal, the 
word/lemma/morpho-syntactic tags beneath. 

This is a relatively easy task: when gold unaryless 
c-trees provided as input, we obtain an EVALB 
Fi-score of 99.43%. This large figure is explained 
by the fact that there are few unary nodes in the 
gold data, so this module does not impact the 
final parser as much as the d-parser. Being a 
lightweight unstructured task, this step took only 
0.7 seconds to run on PTB §22, representing a tiny 
fraction (less than 2%) of the total runtime. 

Table 1 reports the accuracies obtained with the 
d-parser followed by the unary predictor. Since 
two-stage TP-Full with delta-encoding is the best 
strategy, we use this configuration in the subse¬ 
quent experiments. 

5 Experiments 

We now compare our reduction-based parsers with 
other state-of-the-art c-parsers in a variety of tree- 
banks, both continuous and discontinuous. 

5.1 Results on the English PTB 

Table 2 shows the accuracies and speeds on the 
English PTB §23. We can see that our simple 
reduction-based c-parser surpasses the three Stan¬ 
ford parsers (Klein and Manning, 2003; Socher et 
ah, 2013, and Stanford Shift-Reduce), and is on 
par with the Berkeley parser (Petrov and Klein, 
2007), while being more than 5 times faster. The 



Dependency Parser 

UAS 

LAS 

Fi #Toks/s. 

MaltParser 

90.93 

88.95 

86.87 

5,392 

MSTParser 

92.17 

89.86 

87.93 

363 

ZPar 

92.93 

91.28 

89.50 

1,022 

TP-Basic 

92.13 

90.23 

87.63 

2,585 

TP-Standard 

93.55 

91.58 

90.41 

1,658 

TP-Full 

93.70 

91.70 

90.53 

959 

TP-Full + labeler, H&N encoding 

93.80 

87.86 

89.39 

871 

TP-Full + labeler, direct encoding 

93.80 

91.99 

90.89 

912 

TP-Full + labeler, delta encoding 

93.80 

92.00 

90.94 

912 


Table 1: Results on English PTB §22 achieved by various d-parsers and encoding strategies. For de¬ 
pendencies, we report unlabeled/labeled attachment scores (UAS/LAS), excluding punctuation. For con¬ 
stituents, we show Fi-scores (without punctuation and root nodes), as provided by EVALB (Black et al., 
1992). We report total parsing speeds in tokens per second (including time spent on pruning, decoding, 
and feature evaluation), measured on a Intel Xeon processor @2.30GHz. 


Parser 

LR 

LP 

FI #Toks/s. 

Chamiak (2000) 

89.5 

89.9 

89.5 

- 

Klein and Manning (2003) 

85.3 

86.5 

85.9 

143 

Petrov and Klein (2007) 

90.0 

90.3 

90.1 

169 

Carreras et al. (2008) 

90.7 

91.4 

91.1 

- 

Zhu et al. (2013) 

90.3 

90.6 

90.4 

1,290 

Stanford Shift-Reduce (2014) 

89.1 

89.1 

89.1 

655 

Hall et al. (2014) 

88.4 

88.8 

88.6 

12 

This work 

89.9 

90.4 

90.2 

957 

Chamiak and Johnson (2005)* 

91.2 

91.8 

91.5 

84 

Socher et al. (2013)* 

89.1 

89.7 

89.4 

70 


Table 2: Results on the English PTB §23. All 
systems reporting runtimes were run on the same 
machine. Marked as * arc reranking and semi- 
supervised c-parsers. 

best supervised competitor is the recent shift- 
reduce parser of Zhu et al. (2013), which achieves 
slightly better accuracy and speed. Our technique 
has the advantage of being flexible: since the time 
for d-parsing is the dominating factor (see §4.4), 
plugging a faster d-parser automatically yields a 
faster c-parser. Orthogonal techniques, such as 
semi-supervised training and reranking, can also 
be applied to our parser to boost its performance. 

5.2 Results on the SPMRL Datasets 

We experimented with datasets for eight mor¬ 
phologically rich languages, from the SPMRF14 
shared task (Seddah et al., 2014). 5 We used the of¬ 
ficial training, development and test sets with the 
provided predicted POS tags, and different lexi- 
calization rules for each language. For French and 

3 We left out the Arabic dataset for licensing reasons. 


German we used the head rules detailed in Dybro- 
Johansen (2004) and Rehbein (2009), respectively. 
For Basque, Hungarian and Korean, we always 
take the rightmost modifier as head-child node. 
For Hebrew and Polish we use the leftmost mod¬ 
ifier instead. For Swedish we induce head rules 
from the provided dependency treebank, as de¬ 
scribed in Versley (2014b). These choices were 
based on dev-set experiments. 

Table 3 shows the results. For all languages ex¬ 
cept French, our system outperforms the Berkeley 
parser (Petrov and Klein, 2007), with or without 
prescribed POS tags. Our average Fi-scores are 
superior to the best single parser 6 participating in 
the shared task (Crabbe and Seddah, 2014), and to 
the system of Hall et al. (2014), achieving the best 
results for 4 out of 8 languages. 

5.3 Results on the Discontinuous Treebanks 

Finally, we experimented on two widely-used dis¬ 
continuous German treebanks: TIGER (Brants et 
al., 2002) and NEGRA (Skut et al., 1997). For 
the former, we used two different splits: TIGER- 
SPMFR, provided in the SPMRF14 shared task; 
and TIGER-H&N, used by Hah and Nivre (2008). 
For NEGRA, we used the standard splits. In these 
experiments, we skipped the unary recovery stage, 
since very few unary nodes exist in the data. 7 For 
the TIGER-SPMRF dataset, we used the predicted 

6 By “single parser” we mean a system which does not use 
ensemble or reranking techniques. 

7 NEGRA has no unaries; for the TIGER-SPMRL and 
H&N dev-sets, the fraction of unaries is 1.45% and 1.01%. 



Parser 

Basque 

French 

German 

Hebrew 

Hungar. 

Korean 

Polish 

Swedish 

Avg. 

Berkeley 

70.50 

80.38 

78.30 

86.96 

81.62 

71.42 

79.23 

79.19 

78.45 

Berkeley Tagged 

74.74 

79.76 

78.28 

85.42 

85.22 

78.56 

86.75 

80.64 

81.17 

Hall et al. (2014) 

83.39 

79.70 

78.43 

87.18 

88.25 

80.18 

90.66 

82.00 

83.72 

Crabbe and Seddah (2014) 

85.35 

79.68 

77.15 

86.19 

87.51 

79.35 

91.60 

82.72 

83.69 

This work 

85.90 

78.75 

78.66 

88.97 

88.16 

79.28 

91.20 

82.80 

84.22 

Bjorkelund et al. (2014) 

88.24 

82.53 

81.66 

89.80 

91.72 

83.81 

90.50 

85.50 

86.72 


Table 3: Fi-scores on eight treebanks of the SPMRL14 shared task, computed with the provided 
EVALB_SPMRL tool, (pauillac.inria.fr/~seddah/evalb_spmrl2 013.tar.gz) which 
takes into account all tokens except root nodes. Berkeley Tagged is a version of Petrov and Klein (2007) 
using the predicted POS tags provided by the organizers. Crabbe and Seddah (2014) is the best non¬ 
reranking system in the shared task, and Bjorkelund et al. (2014) the ensemble and reranking-based 
system which won the official task. 


POS tags provided in the shared task. For TIGER- 
H&N and NEGRA, we predicted POS tags with 
TurboTagger. The treebanks were lexicalized us¬ 
ing the head-rule sets of Rehbein (2009). For com¬ 
parison to related work, a sentence length cut-off 
of 30, 40 and 70 was applied during the evaluation. 

Table 5.3 shows that our approach outperforms 
all the competitors considerably, achieving state- 
of-the-art accuracies for both datasets. The best 
competitor, van Cranenburgh and Bod (2013), is 
more than 3 points behind, both in TIGER-H&N 
and in NEGRA. Our reduction-based parsers are 
also much faster: van Cranenburgh and Bod 
(2013) report 3 hours to parse NEGRA with L < 
40. Our system parses all NEGRA sentences (re¬ 
gardless of length) in 27.1 seconds, which corre¬ 
sponds to a rate of 618 toks/s. This approaches the 
speed of the easy-first system of Versley (2014a), 
who reports runtimes in the range 670-920 toks/s., 
but is much less accurate. 

6 Related Work 

Conversions between constituents and dependen¬ 
cies have been considered by De Marneffe et al. 
(2006) in the forward direction, and by Collins et 
al. (1999) and Xia and Palmer (2001) in the back¬ 
ward direction, toward the construction of multi- 
representational treebanks (Xia et al., 2008). This 
prior work aimed at linguistically sound conver¬ 
sions, involving grammar-specific transformation 
rules to handle the kind of ambiguities expressed 
in Figure 2. Our work differs in that we are not 
concerned about the linguistic plausibility of our 
conversions, but only with the formal aspects that 
underlie the two representations. 

The work most related to ours is Hall and Nivre 
(2008), who also convert dependencies to con¬ 


stituents to prototype a c-parser for German. Their 
encoding strategy is compared to ours in §4.1: they 
encode the entire spines into the dependency la¬ 
bels, which become rather complex and numer¬ 
ous. A si mi lar strategy has been used by Vers¬ 
ley (2014a) for discontinuous c-parsing. Both arc 
largely outperformed by our system, as shown in 
§5.3. The crucial difference is that we encode only 
the top node’s label and its position in the spine— 
besides being a much lighter representation, ours 
has an interpretation as a weak ordering, leading to 
the isomorphisms expressed in Propositions 1-3. 

Joint constituent and dependency parsing have 
been tackled by Carreras et al. (2008) and Rush 
et al. (2010), but the resulting parsers, while ac¬ 
curate, are more expensive than a single c-parser. 
Very recently, Kong et al. (2015) proposed a much 
cheaper pipeline in which d-parsing is performed 
first, followed by a c-parser constrained to be con¬ 
sistent with the predicted d-structure. Our work 
differs in which we do not need to run a c-parser 
in the second stage—instead, the d-parser already 
stores constituent information in the arc labels, 
and the only necessary post-processing is to re¬ 
cover unary nodes. Another advantage of our 
method is that it can be readily used for discon¬ 
tinuous parsing, while their constrained CKY al¬ 
gorithm can only produce continuous parses. 

7 Conclusion 

We proposed a reduction technique that allows to 
implement a constituent parser when only a de¬ 
pendency parser is given. The technique is ap¬ 
plicable to any dependency parser, regardless its 
nature or kind. This reduction was accomplished 
by endowing dependency trees with a weak or¬ 
der relation, and showing that the resulting class 



TIGER-SPMRL 



L < 40 

L < 70 

all 

Parser 

Fi 

EX 

Ft 

EX 

Ft 

EX 

Versley (2014b), gold 

78.34 

42.78 

76.46 

41.05 

76.11 

40.94 

This work, gold 

82.56 

45.25 

80.98 

43.44 

80.62 

43.32 

Versley (2014b), pred 

- 

- 

73.90 

37.00 

- 

- 

This work, pred 

79.57 

40.39 

77.72 

38.75 

77.32 

38.64 

TIGER-H&N 


L < 30 

L < 40 

all 

Parser 

Fi 

EX 

Ft 

EX 

Ft 

EX 

Hall and Nivre (2008), gold 

- 

- 

79.93 

37.78 

- 

- 

Versley (2014a), gold 

76.47 

40.61 

74.23 

37.32 

- 

- 

This work, gold 

86.63 

54.88 

85.53 

51.21 

84.22 

49.63 

Hall and Nivre (2008), pred 

- 

- 

75.33 

32.63 

- 

- 

van Cranenburgh and Bod (2013), pred 

- 

- 

78.8- 

40.8- 

- 

- 

This work, pred 

83.94 

49.54 

82.57 

45.93 

81.12 

44.48 

NEGRA 


L < 30 

L < 40 

all 

Parser 

Ft 

EX 

Ft 

EX 

Ft 

EX 

Maier et al. (2012), gold 

74.5- 

- 

- 

- 

- 

- 

van Cranenburgh (2012), gold 

- 

- 

72.33 

33.16 

71.08 

32.10 

Kallmeyer and Maier (2013), gold 

75.75 

- 

- 

- 

- 

- 

van Cranenburgh and Bod (2013), gold 

- 

- 

76.8- 

40.5- 

- 

- 

This work, gold 

82.56 

52.13 

81.08 

48.04 

80.52 

46.70 

van Cranenburgh and Bod (2013), pred 

- 

- 

74.8- 

38.7- 

- 

- 

This work, pred 

79.63 

48.43 

77.93 

44.83 

76.95 

43.50 


Table 4: Results on TIGER and NEGRA test partitions, with gold and predicted POS tags. Shown arc Fi 
and exact match scores (EX), computed with the Disco-DOP evaluator (discodop . readthedocs . 
org), ignoring root nodes and, for TIGER-H&N and NEGRA, punctuation tokens. 



of head-ordered dependency trees is isomorphic 
to constituent trees. We have shown empirically 
that the proposed reduction, while simple, leads to 
highly-competitive constituent parsers for English 
and for eight morphologically rich languages; and 
that it outperforms the current state of the art in 
discontinuous parsing of German. 
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